Note: for all answers to questions, place all of your code in “chunks” (provided below) and submit a completed html document to web campus by “knitting” your Rmarkdown document. Insert your name and date in the information at the top!!

Question 1 (2 pts)

  1. read in the file “mydata.csv” using either read.csv() or read.table() and B) display the a dataframe.
setwd("/Users/griffinshelor/Documents/Schoolwork/UNR/GEOG616/Labs/Lab2")
mydata <- read.csv("mydata.csv")
  1. plot cost versus square footage.
plot(mydata$sqft, mydata$cost)

Question 2 (1 pts)

Using the dataframe you loaded in Question 1 - Use the hist() function to plot a histogram of costs with a title of “Cost Distribution”, and an x-axis label of “Cost ($)”

hist(mydata$cost, main = "Cost Distribution", xlab = "Cost ($)")

Question 3 (2 pts)

Using the dataframe you loaded in Question 1 - Create a linear model looking at Cost as a function of Square Footage. Summarize your model and describe what you found.

mydata_lm <- lm(cost ~ sqft, data = mydata)
plot(mydata_lm)

summary(mydata_lm)
## 
## Call:
## lm(formula = cost ~ sqft, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.499 -19.302   2.406  28.019  80.607 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  70.7504    60.3477   1.172   0.2621  
## sqft          0.1878     0.0664   2.828   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.18 on 13 degrees of freedom
## Multiple R-squared:  0.3809, Adjusted R-squared:  0.3333 
## F-statistic: 7.997 on 1 and 13 DF,  p-value: 0.01425
## The minimum and maximum residuals are not similarly far apart from the median. This indicates a possible skew by values with lower square footage that have lower costs than expected. The Q-Q residual plot shows most points near the straight line, with more outliers being near the lower end of square footage than the higher end. The t value and P value for the intercept indicate that it is likely that the intercept is near 70.75. The t and P values for the sqft coefficient indicate that it is unlikely that our coefficient for sqft is 0, indicating there is some statistically significant relationship between sqft and cost.

Question 4 (4 pts)

From the graph in Question 1C it looks there is a possible increase in cost in buildings that are larger than 1000 square feet.

  1. Create a new column in your dataset with 2 factors (categories) of buildings with low (<1000) and high (>= 1000) square footage.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
mydata <- mydata |>
  mutate(building_size = case_when(sqft < 1000 ~ as.factor("low"),
                                   TRUE ~ as.factor("high")))
  1. conduct an analysis using a linear model to explore whether there is a difference in cost between the high and low square footage buildings, display and describe your result. and
size_lm <- lm(cost ~ building_size, data = mydata)
plot(size_lm)

summary(size_lm)
## 
## Call:
## lm(formula = cost ~ building_size, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.773  -24.973    1.527   24.777   70.500 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         213.47      13.07  16.329 4.82e-10 ***
## building_sizehigh    91.03      25.32   3.596  0.00326 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.36 on 13 degrees of freedom
## Multiple R-squared:  0.4986, Adjusted R-squared:  0.4601 
## F-statistic: 12.93 on 1 and 13 DF,  p-value: 0.003259
## The t and p values for both coefficients listed indicate there is likely a statistically significant relationship between building size and cost. The minimum residual being further from the median than the maximum residual indicates that there is a possible tail in the residuals. This could indicate skew. The Q-Q residual plot shows most points near the straight line, also indicating that there is a relationship between relative building size and cost. However, there is again an obvious outlier at the lower left corner of the Q-Q plot.This could affect the adjusted R^2 and explain why it is 0.46 as opposed to a higher value which would represent a stronger correlation.
  1. use a linear model to analyze the cost difference as a function of both square footage, and your new size category, display and describe your result.
sqft_size_lm <- lm(cost ~ sqft + building_size, data = mydata)
plot(sqft_size_lm)

summary(sqft_size_lm)
## 
## Call:
## lm(formula = cost ~ sqft + building_size, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.074 -21.129  -2.127  26.385  68.962 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       186.70682   88.04107   2.121   0.0555 .
## sqft                0.03361    0.10925   0.308   0.7636  
## building_sizehigh  79.29675   46.28620   1.713   0.1124  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.95 on 12 degrees of freedom
## Multiple R-squared:  0.5025, Adjusted R-squared:  0.4196 
## F-statistic: 6.061 on 2 and 12 DF,  p-value: 0.01515
## None of the P values or t values indicate a statistically significant relationship. The intercept comes closest, which makes sense based on the other models showing a relationship where cost goes up as building size goes up, whether building size is represented by specific square footage values or building size as factors. This is likely because sqft and building_size indicate the same attribute in different ways. This means that adding one of those variables to the model when the other is already present means we are not adding new information to the model. The Q-Q residuals plot is also very similar to the same plot from the previous models, indicating that cost does increase with building size, but not in a way that is different from either size variable being modeled on its own.
  1. Which of the models in Questions 3 and 4 is the best model? and why/how did you conclude this?
#Text-only answer
## The model using building_size as a predictor of cost from 4b is the best model because while the Q-Q residual plot looks very similar to the model from question 3, the p-values are lower for the intercept and for building size, which indicates a much stronger relationship between building size and cost, as well as a much stronger indication that the true intercept is approximately 304.5.

Question 5 (3 pts)

The Orings dataset from the Stat2Data package gives data on the damage that occurred in US Space Shuttle launches prior to the Challenger launch 33 years ago on January 28, 1986. The pre-launch charts for this mission that determined the go-no/go for launch contained only the data in rows 1,2,4,11,13,18.

  1. Load the Orings dataframe, B) subset the data to show only these rows,
library(Stat2Data)
data("Orings")
Orings_subset <- Orings[c(1,2,4,11,13,18),]
  1. plot Failures against temperature for the pre-launch data that NASA considered.
plot(Orings_subset$Temp, Orings_subset$Failures)

  1. Compare these data to the full dataset and describe your comparison.
plot(Orings$Temp, Orings$Failures)

## In the full dataset, there was a much higher rate of failure for launches at below 65 degrees, but this may not be a meaningful conclusion since in the full data set a much smaller proportion of the launches happened at below 65 degrees. It is possible that the launches at below 65 degrees were outliers that would tend towards a more normal failure rate as more launches happened. The subset only included launches at above 65 degrees, so it does not provide any comparison at all, let alone a meaningful one.

Question 6 ( 3pts)

  1. First, install and load the datasets package …
  2. Load the attitude dataset
# install.packages("datasets")
library(datasets)
data("attitude")
  1. Plot and interpret the relationship (using a linear model) between the overall rating and the number of complaints handled
plot(attitude$complaints, attitude$rating)

rating_complaints_lm <- lm(rating ~ complaints, data = attitude)
plot(rating_complaints_lm)

summary(rating_complaints_lm)
## 
## Call:
## lm(formula = rating ~ complaints, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8799  -5.9905   0.1783   6.2978   9.6294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.37632    6.61999   2.172   0.0385 *  
## complaints   0.75461    0.09753   7.737 1.99e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.993 on 28 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6699 
## F-statistic: 59.86 on 1 and 28 DF,  p-value: 1.988e-08
## The p values for the coefficients indicate that there is a strong relationship between complaints and rating. The scale-location plot does not indicate heteroscedasticity. The Q-Q residual plot does indicate light tails on both ends of the dataset, meaning that the residuals get unusually larger as complaints get towards the extreme lows or highs of the dataset. However, most of the points in that plot are close to the line, so they are likely not much of an issue and don't necessarily violate the assumption of a normal distribution.

Question 7 (3pts)

For the dataset in question 6, Are the number of raises employees get associated with the overall rating (use a linear model)?

rating_raises_lm <- lm(rating ~ raises, data = attitude)
plot(rating_raises_lm)

summary(rating_raises_lm)
## 
## Call:
## lm(formula = rating ~ raises, data = attitude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.123  -7.109   1.377   5.741  19.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.9778    11.6881   1.709 0.098468 .  
## raises        0.6909     0.1786   3.868 0.000598 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10 on 28 degrees of freedom
## Multiple R-squared:  0.3483, Adjusted R-squared:  0.325 
## F-statistic: 14.96 on 1 and 28 DF,  p-value: 0.0005978
## The P value for the raises coefficient indicates a statistically significant relationship between raises and ratings. The intercept's P-value does not indicate a statistically significant relationship, but the Residuals vs Fitted plot supports a linear relationship.

Question 8 (6pts)

For the models generated in questions 6 and 7, we will evaluate model fits and assumptions.

  1. Evaluate model fits and assumptions for the Question 6 model.
plot(rating_complaints_lm)

plot(fitted(rating_complaints_lm) ~ attitude$rating)

cor(fitted(rating_complaints_lm), attitude$rating)
## [1] 0.8254176
## The residuals vs fitted plot from the LM indicates that the data does not display evidence of a non-linear relationship. However, the Q-Q residual plot has tails at both ends of the plot which indicate that the data does deviate from a normal distribution.
  1. Evaluate model fits and assumptions for the Question 7 model.
plot(rating_raises_lm)

plot(fitted(rating_raises_lm) ~ attitude$rating)

cor(fitted(rating_raises_lm), attitude$rating)
## [1] 0.590139
## Similar to the Question 6 model, the Residuals vs Fitted plot does not display evidence of a non-linear relationship. While the Q-Q Residuals plot does display a few outliers, they are closer to the line and the plot does not display as many tails like the Question 6 model does. This is evidence that the residuals more closely resemble a normal distribution.
  1. Which model appears to be a better fit to the data? and why?
## Due to the difference between the Q-Q resdiduals plots, the Question 7 model predicting rating from raises is better since it does not include tails or deviate from the straight dashed line.

Question 9 (6pts)

The following data represent the total number of aberrant crypt foci (abnormal growths in the colon) observed in nine rats that were given a single does of the carcinogen azoxymethane and examined after six weeks.

87 53 72 90 78 85 83 99 49

  1. Enter these data into a vector and B) calculate the sample mean and standard deviation.
crypt_foci <- c(87,53,72,90,78,85,83,99,49)
crypt_foci_mean <- mean(crypt_foci)
crypt_foci_mean
## [1] 77.33333
crypt_foci_sd <- sd(crypt_foci)
crypt_foci_sd
## [1] 16.72573
  1. set.seed() to 456 and take samples from a normal distribution of same length with means ranging from 5 higher to 50 higher than the mean of A in increments of 1 with the same standard deviation.
set.seed(456)
new_means_seq <- seq(crypt_foci_mean + 5, crypt_foci_mean + 50, by = 1)
crypt_foci_samples_df <- data.frame(new_means = new_means_seq, p_values = 0)

## T-tests for every new mean in the sequence
for(i in 1:length(new_means_seq)){
  rnorm_dist <- rnorm(length(crypt_foci), new_means_seq[i], sd = crypt_foci_sd)
  crypt_foci_ttest <- t.test(crypt_foci, rnorm_dist)
  crypt_foci_samples_df$p_values[i] <- crypt_foci_ttest$p.value
}
  1. at which increment are they statistically different from the mean of the given data, is this consistent?, show a plot of the mean difference (x axis) and the p value (y axis) and draw a red line at the 0.05 level of significance.
crypt_foci_samples_df$mean_diff <- crypt_foci_samples_df$new_means - mean(crypt_foci)
plot(crypt_foci_samples_df$mean_diff, crypt_foci_samples_df$p_values)
abline(h = 0.05, col = 'red')

## The mean differences become statistically significant at increment 14, but do not become consistently significant until increment 25.

Question 10 (5pts)

Graduate students / Undergrad Bonus Points

Read the two papers in the Lab 1 folder by Maass et al. (2007) and Beck et al. (2014). Recognizing that these are very different topics and research, discuss the similarities in terms of the influence of bias on scientific results using examples from the papers. Limit your answer to approximately 300 words.

#Text-only, ~300 words
## The usage of expert evaluation in the Beck et al. (2014) paper but not in the Maass et al. (2007) paper resulted in some similarities in terms of its impact on the results and conclusions of the respective papers. In Maass et al., expert opinions were not used to evaluate the beauty or quality of the soccer goals shown to study participants, but one of the conclusions by Maass et al. was that showing soccer goals to non-experts likely resulted in lower spatial effects in the results and that showing the same goals to soccer experts would likely result in more variance in how “beautiful” a goal was based on soccer experts being more attuned to what distinguishes one soccer goal from another. The Beck et al. (2014) paper included expert analysis of their SDM models and found that Maxent models with fewer background points were graded better by the expert than models with many background points. However, this inclusion of expert analysis could potentially be subject to bias on its own. The authors found that when using fewer background points, their AUC_Maxent reflected a decrease in model quality, an opposite relationship to the trend in expert grading. The goal of using expert assessment to measure model quality in the Beck et al. paper was to discover a way to minimize spatial bias of the GBIF and other similar distributional database. However, expert assessment and model accuracy measurements such as AUC disagreed over whether this occurred. The authors concluded that this was a result of AUC being affected by data bias and thus is not reliable as an indicator of model quality. One question that does not get answered by the Beck et al. paper is whether the expert analysis used to evaluate model quality could potentially be subject to similar spatial bias as detected in the Maass et al. paper on the subjective beauty of soccer goals and violence of movie scenes.